Blog

Anomalo’s Unstructured Data Monitoring: Private Beta

July 17, 2024

Home
Blog
Anomalo’s Unstructured Data Monitoring: Private Beta

Unstructured Data Monitoring in Private Beta Preview

We’re certainly not alone in the opinion that Generative AI (GenAI) will be transformative for the enterprise. But for all its power, it’s sometimes unruly, serving up hallucinations, regurgitating private data, and simply reaching the wrong conclusions.

Much of the effort to corral GenAI has been focused on techniques like prompt engineering and filtering to nudge large language models (LLMs) toward better outcomes. While those efforts are vital, at Anomalo, we look earlier in the process to make further improvements. Businesses will get better outcomes from AI by using higher-quality data and proactively addressing data challenges that may lead to privacy, security, and performance risks.

There’s one big issue, though: 90% of enterprise data is unstructured. Because unstructured data doesn’t adhere to traditional standardized formats, enterprises have significant challenges with organizing, retrieving, and analyzing it. Unstructured data often contains content you don’t want an LLM to learn from, including personally identifiable information (PII), company intellectual property, and abusive language.

To help Anomalo’s customers address these challenges, Anomalo has launched a new capability. Anomalo’s monitoring for unstructured data is now in a private beta preview. Here’s why it’s crucial to pay attention to the quality of preprocessing data in your AI workflow, and why working with Anomalo will give you a big head start on the challenge.

Data quality can make or break enterprise GenAI

The press is full of stories of GenAI gone wrong. Two examples: Air Canada’s chatbot invented a discount, and Samsung banned GenAI due to a data leak.

Researchers and companies worldwide are working on ways to get GenAI to be less wrong, offensive, and sloppy. One thing we’re certain of: GenAI can’t expose a secret, or make a conclusion based on the wrong premise, if it wasn’t exposed to those secrets or premises in the first place.

Enterprise data quality requirements insert an extra layer of complication beyond the hallucinations and misfires of an off-the-shelf LLM. Whether you’re training a model from scratch, fine-tuning a model for a custom application, injecting your documents into a context window, or using RAG to scale to very large document collections, you’re responsible for the business and technical consequences of GenAI outputs.

Just like you are what you eat, your GenAI is what it’s been taught. While it’s thrilling to finally make use of all of the unstructured data that you’ve been keeping because it might be useful someday, think twice before dumping years of tweets, purchase histories, customer service transcripts, and feedback surveys into your preprocessing queue.

Thankfully, we’re not alone in recognizing the issue. According to an AWS/MIT-sponsored survey quoted in the Harvard Business Review, among CDOs and other data leaders, “46% identified ‘data quality’ as the greatest challenge to realizing GenAI’s potential in their organizations.”

It’s hard enough to ensure large databases full of relatively orderly structured data are current, complete, and accurate. Making sense of much bigger collections of heterogeneous and chaotic unstructured data that doesn’t conform to prior traditional standards and formats is the new frontier in data quality.

Why Anomalo is the right choice for addressing the issue

Anomalo excels at finding the unknown unknowns, the issues you never thought to look for. It’s an approach that’s worked great for structured data, and now we’re applying it to the unstructured kind.

Traditional data observability approaches are useful for an overview, like making sure that the right amount of data arrived at the expected time. But with today’s enterprises ingesting and processing enormous amounts of data from highly varied sources and using it to make decisions without human intervention, it’s only by looking into the data itself that you can grow confident in its integrity.

Anomalo has collaborated with some of the world’s most discerning data teams on scaling data quality issue detection, root cause analysis, and resolution. Our uniquely powerful machine learning algorithms get to know how your data normally looks, so it can detect when your data suddenly deviates from expectations. (It does this all very securely, too, either as a SOC 2-certified SaaS or directly within a VPC.)

Now, we’re applying what we know about deep data quality monitoring at scale to the enormous universe of unstructured data.

How to monitor unstructured data

Anomalo evaluates the quality of unstructured data in two dimensions. Appropriately enough, both of these approaches use AI to understand the nature and patterns of the data itself.

The first is ensuring the integrity of the data itself—the same perspective we apply to structured data. For instance, if your fine-tuning refreshes continually or on a cadence, it’d be good to know when new learning material isn’t showing up on time. Other important errors to identify include duplicate content, content that doesn’t match the metadata, and corrupted files.

The second approach is detecting what data shouldn’t be used for training or fine-tuning. From abusive language to personally identifiable information, if you can keep your GenAI model from seeing it, it’s not going to influence or even show up in the output.

Announcing the Private Beta Preview for Existing Customers

If you agree that high-quality preprocessing data is important to your company’s GenAI efforts, we invite you to apply for the private beta for unstructured data monitoring.

This new feature evaluates document collections, such as transcripts, for data quality around various characteristics that go far beyond traditional data monitoring, including document length, duplicates, topics, tone, language, abusive language, PII, and sentiment.

This flagging greatly speeds up the manual evaluation of document collection quality. It identifies issues in individual documents, dramatically reducing the time needed to curate, profile, and leverage high-value unstructured text data.

Unstructured Private Beta Preview Image 3

If you want to discover the benefits of data quality monitoring on unstructured data, apply for the private beta preview. We look forward to hearing from you.